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Abstract 

[j • We bound the number of nearly orthogonal vectors with fixed VC- 

r \ ' dimension over { — 1, l} n . Our bounds are of interest in machine learning 

. , and empirical process theory and improve previous bounds by Haussler. 

i-Q ■ The bounds are based on a simple projection argument and the generalize 

to other product spaces. Along the way we derive tight bounds on the 
sum of binomial coefficients in terms of the entropy function. 

CM ■ 1 Introduction and statement of results 

>. 

I/"") , The capacity or "richness" of a function class F is a key parameter which makes 

a frequent appearance in statistics, empirical processes, and machine learning 
theory [6, 23, 10, 21, 20, 22, 17, 4]. It is natural to consider the metric space 
(F, p), where F C {—1, 1}" and 

o, i - 

P(x,y) = -22l {xiltyi} . (1) 



n 

i=l 



A trivial upper bound on the cardinality of F is 2". When F has VC-dimension 
d, the celebrated Sauer-Shelah-Vapnik-Chervonenkis lemma [19] bounds the car- 



dinality of F as 



■i— n >• / 



(2) 



The notion of cardinality can be refined by considering the packing numbers 
of the metric space (F, p). These are denoted by M(e,d), and defined to be the 
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Asymptotic behavior as d ~» ™ 
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Figure 1: A comparison of upper bounds. 

maximal cardinality of an e-separated subset of F; in particular M(l/n, d) = 
\F\. For general e, the best packing bound for a maximal e-separated subset of 
F is due to Haussler [12]. (A discussion of the history of this problem may be 
found therein.) Haussler's upper bound states that 



M(e,d) <e(d+l) 



2e 



(3) 



In this paper, we propose to study the behavior of M(e, d) for | — c < e < 
| + c (for constant c) . As explained below, this corresponds to the case where 
the vectors of F are close to orthogonal. Our interest in this regime stems from 
applications in machine learning, where some characterizations and algorithms 
consider nearly orthogonal or decorrelated function classes [3, 7, 2]. Our main 
result is Theorem 3.1 (Section 3), which sharpens Haussler's estimate of M(e, d) 
as a function of d and e ~ - . 

It is convenient to state our results in terms of 7 = 1 — 2e (thus, for s ~ ^i 
we have 7 ~ 0). We will denote D. Haussler's bound on |.F| in (3) by 



M((l - 7 )/2, d) < DH( 7 , d) = e(d + 1) 



4e 



and our bound in Theorem 3.1 by 

M((l - 7 )/2, d) < GKM( 7 , d) = 100 • 2 d/3(7) , 



where /3 : [0, 1] — > [2, 00) is defined in (9). 



As d — > oo, our bound asymptotically behaves as 
ln[GKM(7,d)] 



(In 2)0(7) 



while Haussler's as 



ln[DH(7,d)] / 4e 



In 



1-7 



Figure 1 gives a visual comparison of these bounds, illustrating the significant 
improvement of our bound over Haussler's for small 7. 

Our analysis has the additional advantage of readily extending to fc-ary al- 
phabets, while the proof in [12] appears to be strongly tied to the binary case. 
In Theorem 4.1 we give what appears to be the first packing bound for alpha- 
bets beyond the binary in terms of (a generalized) VC-dimension (but see [1, 
Lemma 3.3]). 

We further wish to understand the relationship between M(e,d) and n for 
fixed £ and d. It is well known [18] that when 7 = 1 — 2e = 0(l/y/n), we 
have M(e,d) = 0(poly(n)). Since in many cases of interest [14] the coordinate 
dimension n may be replaced by its refinement d vc , it is natural to ask whether 
a poly(n) bound on M(e,d) is possible for 7 = 1 — 2e = 0(l/poly(n)). We 
resolve this question in the negative in Theorem 5.1. 

Finally, in Section 6 we give a simple improvement of Haussler's lower bound. 
Haussler exhibits an infinite family {F n C {—1, 1}™} for which d vc (F n ) — d and 

M{e,d)> ( — l - \ . (4) 

v ; " \2e(e + d/n)J K ' 

He notes that the bounds in (3) and (4) leave "a gap from l/2e to 2e for the 
best universal value of the key constant" and poses the closure of this gap as 
an "intriguing open problem". The gap has recently been tightened to [l,2e] 
by Bshouty et al. [5, Theorem 10], in a rather general and somewhat involved 
argument. Our lower bound in Theorem 6.1 achieves the same tightening via a 
much simpler construction. 



2 Definitions and notation 

Our basic object is the metric space (F, p), with F C { — 1,1}" and the normal- 
ized Hamming distance p defined in (1). The inner product 



1 " 
V) ■= -T^XiVi, x,y e F 



n . 



endows F with Euclidean structure. The distance and inner product have a 
simple relationship: 

2p(x,y) + (x,y) = l. (5) 



We denote the natural numbers by N = {1,2, . . .}, and for n € N, we write 
[n] = {0, 1, . . . , n — 1}. For / = (i\, 12, . . . , ik) Q [n], we denote the projection 
of F onto / by 

F\ I = {(x il ,...,x ik ):x€F}C{-l,l} k . (6) 

We say that F shatters I if F\j = { — 1, 1} and define the Vapnik-Chervonenkis 
dimension of F to be the cardinality of the largest shattered index sequence I: 

d vc (F) = max {|J| : I C [n], F| 7 = {-1, l} fe } . 

We define 7 = Jobt(F) by 

TortC^) = max{| (», y ) | : x ?6 y G F} . (7) 

In words, 7ort(-F) is the smallest 7 > such that all distinct pairs x,y € F are 
"orthogonal to accuracy 7" . Whenever (7) holds for some 7, we say that F is 
7-orthogonal. 

We will use In to denote the natural logarithm and log = log 2 . 

3 Upper estimates on nearly orthogonal sets 

3.1 Preliminaries: entropy and j3 

Recall the binary entropy function, defined as 

F(a;) = -a;logx-(l-x)log(l-ar). (8) 

In the range [0, 1], this function is symmetric about x = 51 where it achieves its 
maximum value of 1. 

Since H is increasing on [0, ^], it has a well-defined inverse on this domain, 
which we will denote by H^ 1 : [0, 1] — >• [0, \}. We define the function j3 : [0, 1] — )• 
[2,oo) by 

P{7) = ff-i[log(2/(l+ 7 ))]- (9) 

Figure 2 illustrates the behavior of fi on [0, j]. 

A sharp bound on Y]j— (™) in terms of H is given in Lemma 7.1. 

3.2 Main result 

Theorem 3.1. Let F C {-1, 1}" with l<d = d vc (F) < n/2 and-/ = -y OHT (F). 
Then 

\F\ < 100 -2^W 

where /3(-) is defined in (9). 




Figure 2: The function /3(7). 



Proof. Let r < n be unspecified for the moment and choose I C [n], |/| = r 
uniformly at random. Define 7r = 717 to be the coordinate projection of .F onto 
/ as defined in (6). Let x and y be two uniformly random elements of F, and 
let A be the event that n(x) = n(y); thus, P(A) is the probability that x and j/ 
are mapped to the same vector. The latter is upper-bounded by the sum of the 
probability that x and y are the same vector, and the probability that x and y 
are distinct vectors but are mapped to the same vector. The first event occurs 
with probability exactly l-Fl -1 . We claim that the second event occurs with 
probability less than (| + ^j) r - To see this, suppose that the two vectors x,y 
agree on 77 fraction of the coordinates. Then 77 < \ J r\ r ) and the probability that 
they agree on one random coordinate is exactly 77. The probability they agree 
on two coordinates is 77(717/ — 1)/(?1 ~ 1), and so forth. Thus, the probability 
that they agree on r coordinates is 

77(7^ - l)/(n - 1) • • . • • (nr, - (r - l))/(n - (r - 1)) < v r < (± + ^f . 

By the union bound, we have 



P(A)<\F\- 1 - 
As a lower bound on P(A), we claim 

P(A)" 1 



1 1 

2 + 2 7 



< 



E 

=0 



(10) 



(11) 



Indeed, if E is any finite set equipped with distribution Pg, then the probability 
of collision (i.e., drawing e, e' S E independently according to Pe and having 



e = e') is given by Pe(s = e') = J2eeE Pe(c) 2 - Now by Jensen's inequality, 



;e£E 

2 



\E\~ 2 



\e£E J eEE 



which implies 

P E {e = e') = Y J PE{e?>\E\-\ (12) 

eEE 

Let us denote the event that ir(x) = 7r(y) conditioned on / by A \ I, and 
write -Pjr for the distribution on F' :— F\j induced by n. Then we have 

P(A\I) = J^ P ^ x 'f 

x'eF' 

\i=0 V 

where the first inequality is seen by taking E — F' and Pe = P-k in (12) and 
the second holds by Sauer's Lemma (2). The claim (11) follows by averaging 
over all the Is. 

Combining (10) and (11) with Lemma 7.1, we get the key inequality 



1.02 • 2- rH( - d ^ < A / ' ' 



F rV 2 + 2^< (13) 

valid for all integer r € [2d, n] . We choose the value 

where the function /?(•) is defined in (9). It is straightforward to verify from the 
definition of /?(•) that for this choice of r* , we have 

2-r'W) > Q + I 7 

and therefore 

.02-2-''* < \F\- 1 ; 
combining this with (13) yields 

\F\ < 50-2^(T) d l 

< 100 • 2^ d . 



D 



4 Generalization to k-ary alphabets 

Here we extend our upper bound analysis to fc-ary (k > 3) alphabets. First, 
we must generalize the notion of orthogonality. Since two vectors x, y drawn 
uniformly from [k] n agree in expectation on n/k coordinates, we may define 

lk{x,y) by 



T—^p(x,y) + ^k(x,y) = l, (14) 

where p is the normalized Hamming distance defined in (1). Analogously, we 
define 7ort(^) b y 

7* RT (F) = max{| 7fe (x, y)\ : x jt y E F} . (15) 

The notion of VC-dimcnsion has various generalizations to fc-ary alphabets [11, 
15, 16, 17]. Among these, we consider Pollard's P(seudo)-dimension, Natara- 
jan's G(raph)-dimension, and the GP-dimension; these are defined in equations 
(13,14,15) of [13], respectively. In the sequel we continue to write d vc {F) to 
denote one of these combinatorial dimensions, without specifying which one we 
have in mind. This convention is justified by a common generalized Sauer's 
Lemma shared by these three quantities, due to Haussler and Long [13, Corol- 
lary 3]: 

dvc(F) . 

n 



\f\< y: y. (i6) 

i=0 v / 

A sharp bound on the rhs of (16) is given in Lemma 7.2. 
Our main result is readily generalized to fc-ary alphabets: 

Theorem 4.1. Let F C [k) n with ^|fg < d = d vc (F) < -^-^ and 7 = 
7 * RT (F). Then 

\F\ < 34fc d 2 d/l5(7 ' fe) 

where 5(7, k) is the largest x G [0, fc/(fc+l)] for which x log k+H(x) < log(fc/(l + 
(k - 1)7)) holds. 

Remark: The function 6 : (0, 1) X N — )■ (0, 1) is readily computed numerically. 

Proof. Repeating the argument in Theorem 3.1 (with the generalized Sauer 
Lemma (16)), we have 




k 

6fc 



Applying the bound in Lemma 7.2, we have that for fc _°« < d < ,_T* „ , 

1.06 • 2 - rH W r )- dU * k < l^r 1 + (- + ^^ 7 V . 

\k k J 



Now we seek the minimum integer r € [ + }' 6 d, n] that ensures 

d\ogk + rH(d/r) < rlog(fc/(l + (k - 1)7)). 

To this end, we consider the following inequality in x 

x\ogk + H(x) <log(fc/(l + (fc-l)7)). (17) 

Note that the inequality (17) is satisfied at x — and define x* = £(7, k) to be 
the largest x € [0, k/(k + 1.6)] satisfying it (the proof of Lemma 7.2 shows that 
the lhs of (17) is monotonically increasing in this range). Taking r* = \d/x*~\, 
we have 

.06 • 2~ r " Hl - d / r "}~ dlogk < \F\~ l 
which rearranges to 

|F| < u . 2 r " H ( d / r ")+ dlo s k 
< 34fc d 2 d/<5(7 < fe) , 

as claimed. □ 

5 Polynomial upper bounds for small 7 

The bounds of Haussler (3) and Theorem 3.1 obscure the dependence of |F| on 
its coordinate dimension n. It is well known that when 7ort(-F 1 ) = 0(l/y/n), 
we have F = 0(poly(n)). (In the degenerate case 7ort(-F 1 ) = 0, linear algebra 
gives \F\ < n + 1.) 

Roth and Seroussi [18] developed a powerful technique for bounding \F\ in 
terms of n and 7. Let < p m - m < p max be such that 

Pmin < np(x, y) < p max 

for all x,y <G F. Then [18, Proposition 4.1] shows that 



2 

i.n-'-< l:i 

.Pg. 



-1 / I 1 M / Pa 



where p a = \{p m \ n + p max ) and p g = ^/9 m i n /9 max . Recalling the relation in (5), 
we have 



n ^ n ,^ n n 



Pmax = ^ I 1 + 7), Pmin = -(1 - 7), pa = ^ ' ^9 = 3 V * ~ 77 

which implies the following bound on |_F|: 



n J 1 — 7 2 



Note that when j 2 > n _1 , the right-hand side is least 1 and the bound is 
rendered vacuous; thus the nontrivial regime is 7 2 < n _1 . In particular, taking 
7 = l/(c\/n) for c > 1 yields the bound 

, , c 2 n — 1 , N 

1*1 ^ ^-r ( 18 ) 

Since in many situations, the VC-dimcnsion d vc is a refinement of the co- 
ordinate dimension n, it is natural to ask if a bound similar to (18) holds with 
d V c in place of n. We resolve this question strongly in the negative: 

Theorem 5.1. Let a > be some constant. Then there infinitely many n G N 
for which there is an F C {— 1, 1}™ such that 

(a) 7 = d- a 

(b) \F\ 



cxp I cn 2a + 1 



where 7 = 7 rt(-F 1 ) , d — d vc (F) and c is an absolute constant. 

Proof. Let F be an m x n matrix whose entries are independent symmetric 
Bernoulli { — 1, 1} random variables; we shall identify the rows of F with the 
functions in F. Then for f,g € F, we have 

E(/,.g) = 

and by Chernoff 's bound 

P{|(/,5>l>7}<2exp(-n 7 2 /2) 

for all n G N and 7 > 0. The union bound implies that for n large enough there 
exists an F C {—1,1}" with j rt(F) < 7 and 

\F\ = Lcxp(n 7 2 /4)J . 

The claim follows from the relation 

d = d vc (F) < log 2 \F\ < n 7 2 /41n2 

and our choice of 

7 = d- a . 



U 

An alternative estimate may be obtained via the Gilbert- Varshamov bound 

[9, 24]. 



6 A lower bound on the universal constant cq 

Haussler's upper (3) and lower (4) bounds imply the existence of a universal cq 
for which the packing number M{e,d) grows as 8((co/£) d ) in e for constant d. 
More precisely, 

(i) M(e, d) = O(d(c /e) d ) for all n, F C {-1, 1}" with d vc (F) = d 

(ii) M(e n ,d) = n((co/e n ) d ") for some infinite family (e n ,d n ,F n C { — 1,1}™) 
with d vc (F n ) = d n . 

The bounds in (3, 4) peg c$ at l/2e < Co < 2e. An improved lower bound of 
Co > 1 may be obtained essentially "for free" (cf. [5, Theorem 10]): 

Theorem 6.1. There exists an infinite family (s n , d ni F n C {—1, 1} ) for which 

(a) d vc (F n ) = d n 

(b) M(e n ,d) = (l/e n ) d " 

Proof. For n = 1, 2, . . ., put e n = \, d n = n, and F„ C {—1, 1}™ to be the rows 
of i?2™ , the Hadamard matrix of order 2™. The latter may be defined recursively 
via 

Hi = [1] 
and 



H- 



2"+i 



i?2" i?2" 

i^2»i —Hit, 



It is well known (and elementary to verify) that d vc (F n ) = n and that 7ort(-Pti) = 
0. Thus F n is a ^-separated set of size 2™. □ 



7 Technical Lemmata 

Our main result in Theorem 3.1 requires a sharp estimate on the sum of the 
binomial coefficients. It is well known [8] that for d < f , £?= (") < 2 nH( - d / n \ 
but we need to obtain a slightly tighter bound. 

Lemma 7.1. For 1 < d < "-, we ftave 

4=0 

where S — 0.98. 






Remark: The bound 5 can be further tightened, at the expense of a more 
complicated proof. Note however that when d = n/2 the summation is equal to 
i_2nH(d/n) g0 g canno t be taken as a constant better than i 
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Proof. Recall Stirling's approximation i\ = y/2m (-) e Ai where 12+1 < Aj < 
t^tt. Also note that for < i < n, 



1 1 1 -144n 2 + 122m - 144i 2 - Yin 

Y2n ~ 12(n-i) + l ~ Yli + 1 ~~ (12n)(12n - 12i + l)(12i + 1) 



<0. 



Thus, 



i\{n — i)\ 



< e> 



< 



12(r»-i) + l 12i+l 



2ni(n — i) i l (n — i) n 



1 



^271-1(1 - i/n) 

1 
^2^(1 - i/n) 



■{i/n)- l {l-i/n)- {n -^ 



nnH(i/n) 



We first prove Lemma 7.1 for small values of d, in particular 1 < d < n/4. 
Note that for i < d < n/4 we have 



n 
i-1 



i fn\ 1 In 

n — i + 1 \i J 3\i 



and therefore 



E" 



< 1.5 



< 



< 



1.5 



^27rd(l - d/n) 

n _ c\nH{d/n) 



c\nH{dJr 



We now turn to the case of large d, that is j < d < J. If X^i=o (?) < 
0.5 • 2 ra// ( d /™), then Lemma 7.1 immediately holds, so we may assume that Z := 
J2t=o (?) > 0.5 • 2 nH ( d /"). We will show that in this case, much of the weight of 
the sum is distributed among at least Sl(^/n) coefficients. We will use this fact 
in conjunction with the standard entropy argument, sec e.g., [8] to obtain the 
desired result. 

Now, we have for alii < d (when j < d < ^), 



1 



' / " \d) " yj2itd{\ - d/n) 



_ cyriH(djn) % 2 nH ( d / n ) < ^ 



Consider the random vector (Xi, . . . ,X n ) uniformly distributed in {x : 
{0, 1}" : £\ Xi < d}. Then for all < r < d we have: 



5> 



z- 



n\ 4 

< 
i 



11 



and therefore 



P 



E\/irn 



Jwh 4 1 

< 1= < -■ 

8 i/jm 2 ' 



which implies 



E 



E^ 






< dp 



J2^>d 



1-P 



J2^>d 



- 2 ' '' + ''' 



= d- 



16 



Hence, we obtain 



H(X u ...,X n ) < nH(X i ) = nH(E[X i }) 

.,,' d a/7T 

< nH 



n 16-y/n 



n \ \n 



m 



I- 

n 16^/n 

1 



" / ".s)- B ("(5)-*(5-i5|)r 



where the second inequality uses the monotonicity of the binary entropy function 
H at [0, i], and the third uses the concavity of H . Noting that the Taylor series 

expansion of H (x) around | is equal to 1 — ^-j Sj*li -vo. n < ! ~~ ~ — ~~ 
we have that 



" {1 2)- H {^-^k 



2 In 2 ^j=l j(2j-l) 

1 TT 

> 



2 In 2 ' 



from which we conclude that 

H(X 1 ,...,X n ) <nH(d/n) 

Hence, we have 



21n2 64n' 



128 ln2 



E 

i=0 



n \ = 2 H(X lt ...,X n ) 

< 2~ 128^2 2" H W") 

< .98 • 2 nff ( d /"^ 

where the first identity holds because H{Y) = log |supp(Y)| when Y is uniformly 
distributed on its support. This completes the proof. □ 

Our extension to fc-ary alphabets requires the corresponding analogue of 
Lemma 7.1: 
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Lemma 7.2. For 2 < d < k \ 6 • n and n > 6, we have 

Y| |fc* < -94 • 2"- HW " )+dlogfe . 



Froo/. First note that the derivative of f(i) = 2 nH{l l n ^+ ll °z k is f'(i) = /(i)[ln(f - 
1) + lnfc]di, so /(i) attains its maximum over the range < i < n at i 
Further note that since i < d < ,„ ,*; g • n < 

ln(f -l)+lnfc> 



fe+1.6 



fe+e i/VL«/2j+i 



fc+T "■ 
• n, we have that 



VL"/2J+i' 
We break up the analysis into two cases: When d < ^ we have 



E ">' < 



d 

2fc d 



?rrf 
2 



< 



/71-d 
< -8 -/(d). 



2k d 

eynH(d/n) 

/(d) 



When d > ^ we have 



i=0 



Ln/2j 



EC-)*' - E''.i* 

i=0 

2/(Ln/2j) 



E 



< 



< 



< 



< 



< 



i=0 x ' i=ln/2\+l x 

^ <ynH(l — i/n) ui 

^K2j ' 4=L ^ J+1 V27ri(i/n) 



2/(K2j) + 

y^K2j v 7 ^ 

2/(Ln/2j) + 

y^K2j v 7 ^ 

2/(Ln/2j) + 

V^K2j V^ 

2/(K2j) + 

y^K2j v 7 ^ 

2/(K2j) + 

V^K2j v 7 ^ 
< -94/(d). 



1 

x V^ 2™- H "( i /™)+ i l°' 

L-/2J + i) 4=L ^ J+1 



g /,: 



/(d) 



d-l 

i/21+l) . ,4?,./ 



Ln/2j+l) N/7r(Ln/2j+l). = L ^ J+1 
/(d) 1 



Ln/2J + 1) 0r(|_n/2j+l) -Wj+i 
/(d) ' '■" 



m 



= + -=/ f(i)[ln(--l)+lnA;]di 

1) \A./|„/2|+1 * 



|_n/2j +1) v 71 ^L«/2j+i 
/(d) , /(d) /(|n/2j+l) 



Ln/2j + 1) V^ 



D 
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